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ABSTRACT 

Pressed  by  the  need  to  improve  oil  analysis  performance,  some  equipment  operators  have 
increased  sampling  frequency  (shortened  intervals)  in  order  to  increase  the  probability  of  early 
fault  detection.  As  a consequence,  laboratory  labor  costs  increased  considerably— quadrupled  in 
some  cases.  Over  the  past  10  years,  expert  systems  have  been  increasingly  used  to  compensate 
for  the  increased  processing  time  by  automatically  interpreting  sample  data  in  near  real  time, 
improving  evaluation  reliability  and  minimizing  the  associated  labor  costs. 

A properly  designed  expert  system  can  quickly  review  all  recorded  equipment  and  sample 
data,  while  keeping  the  analysis  time  and  costs  within  acceptable  levels.  These  systems  greatly 
increase  data  interpretation  consistency,  and  can  generate  significant  retums-on-investment. 
This  paper  presents  an  overview  of  several  of  these  systems  and  the  general  principles  utilized  in 
their  development. 


KEY  WOr^S:  Expert  system;  fault  detection;  fault  library;  knowledge  Kased  systems;  oil 
condition  monitoring;  railway  systems;  statistical  process  control;  used-oil  analysis 

INTRODUCTION 

Used-oil  analysis,  pioneered  by  American  railways  and  the  Department  of  Defense,  has  led 
to  significant  reductions  in  unexpected  equipment  failure  and  has  increased  equipment  reliability 
and  safety  of  operation*1).  The  traditional  used-oil  analysis  process  is  based  on  frequent  periodic 
samples,  simple-to-use  test  methods  and  trained  oil  analysts  to  evaluate  findings  and  advise 
maintenance  personnel  of  a required  action.  The  high  sampling  frequency  required  for  reliable 
monitoring  of  diesel  powered  (<  250  hours)  and  gas  turbine  powered  (<  25  hours)  equipment 
requires  substantial  labor  (and  costs)  for  testing  and  data  interpretation.  The  greater  the  sample 
frequency  or  the  number  of  tests  performed  on  each  sample,  the  greater  the  costs  involved.  In 
addition,  reliable  interpretation  of  machinery  and  fluid  condition  requires  assessment  of  many 
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other  types  of  data  including,  oil  related  failure  mechanisms,  their  symptoms,  effects  and  costs; 
machine  configuration  changes  and  utilization;  scheduled  and  completed  maintenance;  etc. 
Requiring  a review  of  all  possible  data  for  every  sample  is  very  labor  intensive  and  costly.  As  a 
consequence,  the  scope  of  an  oil  analysis  program  is  often  limited  to  fit  the  available  resources. 
The  usual  result  is  longer  sample  intervals,  less  analysis  reliability,  fewer  quantifiable  benefits  and 
less  confidence  in  the  program. 

Pressed  by  the  need  to  improve  locomotive  reliability  after  the  recession  of  1981/82,  the 
Canadian  Pacific  Railway  (CP)  conducted  an  evaluation  of  locomotive  engine  failure  modes,  their 
effects  and  costs®.  One  of  the  outcomes  of  this  evaluation  was  the  conclusion  that  the  used-oil 
analysis  program  did  not  always  indicate  the  occurrence  of  an  oil  related  engine  failure  mode  prior 
to  failure.  In  fact,  agreement  between  laboratory  recommendations  and  subsequent  inspection  or 
failure  reports  indicated  the  laboratory  reliability  to  be  about  65%.  At  the  time,  CP,  as  with  most 
railways,  obtained  engine  oil  samples  during  Federal  Government  mandated  inspection0  at  either 
46  or  92  day  intervals.  These  intervals  were  very  convenient  but  failed  to  respect  failure  mode 
duration  or  timi  g.  Oil  analysis  at  such  long  sample  intervals  identified  some  faultJ  before  failure, 
although  sometimes,  engine  failure  was  the  first  sign  of  trouble.  Engine  reliability  was  not 
considered  a serious  problem  as  sophisticated  preventive  maintenance  (PM)  procedures  kept 
failure  rate  down.  However,  with  large  fleets  of  1000  plus  locomotives,  even  a “low  number”  can 
be  significant.  The  40  or  so  engine  failures  per  year  experienced  by  CP  amounted  to  several 
million  dollars  in  loss  and  it  was  thought  that  something  could  be  done  to  recover  these 
expenditures.  It  was  also  thought  that  more  effective  monitoring  would  lead  to  longer  PM 
intervals  with  a considerable  reduction  in  maintenance  costs. 

The  failure  modes  analysis  conducted  at  CP  during  the  mid  1980’s  indicated  that  oil 
contamination  and  metallic  wear  faults  were  the  most  prevalent  problems  and  would  provide  the 
highest  return  on  investment  for  an  oil  analysis  program.  The  symptoms  of  these  faults  include: 
coolant  contamination,  fuel  dilution,  metallic  wear,  incorrect  oil  addition  and  bad  sample 
recognition.  While  other  oil  related  faults  are  possible,  they  were  statistically  non-prevalent  in  the 
CP  locomotive  fleet.  Analysis  of  failure  mode  progression  intervals  suggested  that  the  sample 
interval  required  a significant  reduction  if  the  important  failure  modes  were  to  be  monitored 
reliably.  To  maximize  the  probability  of  early  fault  detection,  CP  shortened  the  sample  interval  to 
about  200  to  250  hours  of  engine  operation  or  about  every  7 to  10  running  days.  This  however, 
resulted  in  a three-to-four  fold  increase  in  the  sample  collection  rate-and  a commensurate 
increase  in  laboratory  labor  and  consumables’  costs.  Offsetting  this  increase  was  very  desirable 
and  expert  systems  were  investigated  as  a possible  solution. 

DEVELOPMENT  CONSIDERATIONS 

The  real  world  of  used-oil  analysis  presents  many  obstacles  to  the  expert  system 
developer.  Consider  the  following  problems  which  had  to  be  overcome  during  the  development 
of  the  CP  expert  system: 

1.  When  the  CP  system  development  started  in  1984,  very  little  documentation  or  published  work 
on  "used  oil"  analysis  was  available.  Consequently,  the  company  had  to  document  and  validate  all 


of  the  factors  and  relationships  involved  in  the  used-oil  analysis  process.  Since  there  were 
multiple  “expert”  opinions  and  very  little  science,  this  was  a very  frustrating  process.  Taking 
samples,  applying  some  ASTM  tests  and  reporting  the  data  does  not  make  a condition  monitoring 
program.  One  must  have  a plan,  tailored  to  some  objective  such  as  failure  prevention,  or 
maintenance  costs  reduction.. 

2.  As  with  most  traditional  practitioners  of  used-oil  analysis,  CP  compared  individual  oil 
measurements  to  a set  of  empirical  limits  and  recommended  an  inspection  in  the  event  of  a limit 
exceedence.  Little  was  known  about  the  statistical  behavior  of  sample  data,  the  relationships 
among  data  parameters,  the  relationships  between  data  and  fault  mechanisms,  or  the  impact  of 
operational  policies  on  data  variability.  Limits  were  arbitrarily  determined  by  an  “expert”  by 
somewhat  mysterious  procedures.  In  summary,  the  data  evaluation  procedure  depended  on  the 
intuitive  response  of  a highly  train  id  person.  The  procedure  was  not  completely  clear  and  could 
not  provide  a reliable  expert  system.  The  entire  data  interpretation  process  had  to  be  re-thought 
and  developed  into  a general  purpose  paradigm  which  could  be  encoded  into  the  available  expert 
system  software. 

3.  Used-oil  sample  data  is  subject  to  frequent  change  from  both  reported  and  unreported  events. 
When  a technician  encounters  a fluctuating  data  pattern,  a variety  of  intellectual  processes  can  be 
utilized  for  interpretation  and  problem  resolution.  This  is  time  consuming  and  very  costly,  but 
generally  works.  However,  an  expert  system  does  not  readily  adapt  to  fluctuating  data  and  should 
only  be  used  where  data  variability  can  be  controlled,  or  the  interpretative  knowledge  can  be 
provided.  Fortunately,  analysis  of  several  years  of  test  and  maintenance  history  indicated  that  oil 
data  variability  can  be  brought  under  control  by  strict  adherence  to  corporate  standard  operating, 
maintenance,  sampling  and  testing  procedures.  Any  remaining  sample  data  variability  is  usually 
compensated  by  an  adaptive  trending  algorithm— part  of  the  expert  system  implementation. 

4.  In  addition  to  evaluating  sample  data,  reliable  used-oil  analysis  recommendations  require 
access  to  and  the  evaluation  of,  a myriad  of  operational  factors  including  equipment  configuration 
change,  maintenance  and  usage.  T 'owever,  a knowledge  b?-e  that  includes  all  of  these  factors,  for 
each  different  .quipment  model,  function  and  operational  circumstance,  w;”  be  enormous, 
difficult  to  develop  and  even  more  difficult  to  validate  and  maintain. 

A PRACTICAL  EXPERT  SYSTEM 

Consequently,  CP  decided  on  a general  purpose,  statistically  based  data  analysis  paradigm 
which  would  be  compact,  efficient  and  easy  to  maintain.  The  development  and  validation  of  this 
procedure  required  many  months  of  effort,  and  the  analysis  of  many  thousands  of  used-oil 
samples.  The  new  software  integrated  spectrographic,  water  contamination  and  oil  viscosity 
analyses  with  an  expert  system  for  data  interpretation.  The  expert  system  evaluated  all  relevant 
test  and  maintenance  data  and  significantly  improved  the  performance  and  consistency  of 
interpretation  while  eliminating  the  labor  involved.  In  early  1987,  a statistical  study  of  20,000 
samples  taken  from  1200  locomotives  over  a 3 month  period  verified  the  premise  that  simple 
analytical  tests  performed  at  a high  frequency  provided  reliable  indicators  of  engine  and  lubricant 
condition.  The  study  placed  the  effectiveness  of  the  CP  expert  system  at  98.6%,  with  no  engines 
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missed(2>.  By  comparison,  the  oil  analysis  data  interpretation  effectiveness  before  1986  averaged 
less  than  65%.  By  the  beginning  of  1988  oil-related  engine  failure  occurrences  at  CP  were  nearly 
eliminated.  In  addition,  the  company  moved  to  a reliability  centered  maintenance  (RCM)  program 
which  significantly  extended  component  life  utilization.  The  move  would  not  have  been  possible 
without  the  availability  of  up-to-date  accurate  equipment  condition  data.  Today  the  shortened 
interval,  consistent  analysis  and  quick  turnaround  time  permits  problem  engines  to  be  inspected 
and  repaired  early,  generating  substantial  savings  in  materials  and  labor. 

The  fact  that  the  original  CP  expert  system  is  still  in  operation,  and  without  modification 
since  1989,  attests  to  it’s  success.  In  fact,  many  other  equipment  operators,  including  Canadian 
National  Railways,  Chinese  National  Railway,  CSX  Transportation,  Royal  Canadian  Navy<3)  and 
Royal  Navy  have  implemented  oil  analysis  expert  systems  based  on  the  general  principles  first 
used  at  CP  Rail.  While  the  performance  statistics  of  the  military  applications  are  unknown,  it 
should  be  noted  that  Canadian  National  and  CSX  Transportation  also  recorded  marked 
improvements  in  locomotive  failure  rates  and  oil  utilization  after  their  oil  analysis  expert  systems 
were  commissions, _ In  each  case,  a reduction  in  engine  failures  of  40%  or  greater  was  reported 
at  the  end  of  the  first  year  of  operation.  These  benefits  were  a dirr  zt  result  of  consistent,  early 
problem  indication,  and  improved  maintenance  scheduling. 

GENERAL  DATA  INTERPRETATION  PRINCIPLES 

Before  an  equipment  monitoring  process  can  be  automated  by  an  expert  system,  it  is 
necessary  to  completely  understand  the  equipment  system  in  terms  of  how  it  behaves;  its  inputs 
and  outputs;  it’s  performance,  reliability  and  cost  factors,  etc.  It  is  also  necessary  to  reduce  this 
information  into  a set  of  general  principles  which  can  be  easily  encoded  into  an  expert  system 
knowledge  base*4’5).  If  this  is  not  done,  the  knowledge  base  grows  at  an  exponential  rate  as 
individual  machine  related  rules  and  relationships  are  added.  Such  a knowledge  base  is  very 
complex  and  difficult  to  validate  or  maintain.  Fortunately,  statistical  process  control  (SPC) 
procedures  provide  a convenient,  compact  and  general  purpose  paradigm  for  an  expert  oil  analysis 
system.  In  a SPC  based  paradigm,  the  machinery  train  or  equipment  fleet  is  viewed  as  a “closed 
loop”  system  as  shown  in  Figure  1 below. 


In  a closed  loop  system,  frequent  oil  samples  provide  condition  indicating  data  (feedback); 
sample  analysis  determines  the  individual  component  condition  and  standard  responses  to  direct 
machine  rehabilitation  (control).  Abnormal  sample  data  identifies  failing  components;  once 
identified,  they  are  removed,  repaired  or  replaced  and  cease  to  become  contributors  of  abnormal 
data;  oil  consumption,  make-up  addition  and  oil  change  function  to  restore  sample  data  to  normal 
levels.  Thus,  the  system  attains  a dynamic  equilibrium  and  under  normal  circumstances,  a 
relatively  normal  frequency  distribution. 

When  the  machinery  system  is  maintained  by  consistent  procedures,  sample  data  also  tends 
to  be  consistent,  changing  only  as  a function  of  equipment  usage,  a fault  occurrence  or  a 
maintenance  action.  Since  condition-data  variability  due  to  usage  and  consistent  maintenance  is 
predicable,  and  usage  and  maintenance  data  is  recorded,  condition-data  variability  due  to  a 
developing  fault  is  easily  identified.  In  fact,  data  deviation?  caused  by  some  maintenance 
procedures  can  be  monitored  by  the  occurrence  of  a particular  pattern  or  trend  and  provide  an 
additional  source  of  useful  information  to  the  maintenance  manager.  In  addition,  historical 
condition-data  from  a normally  operating  equipment  system  can  be  utilized  to  calculate  statistical 
alarm  limits  for  each  fault  signature  monitored.  Thus,  an  SPC  based  condition  monitoring  system 
is  simple,  general  purpose,  machine  independent,  easy  to  automate  with  an  expert  system,  and 
easy  to  maintain  once  deployed.  However,  condition-data  will  only  exhibit  a “normal” 
distribution  if  proper,  consistent  operating  practices  are  implemented.  Any  improper  practice 
such  as  incorrect  sampling,  improper  maintenance,  excessive  oil  change,  permitting  deep  sumps  to 
run  low,  etc.  will  profoundly  affect  system  equilibrium  and  data  distribution,  with  a commensurate 
effect  on  the  reliability  of  sample  data  interpretation  and  alarm  limit  calculations. 

STATE  OF  THE  ART 


Oil  data  interpretation  can  be  performed  by  a simple  set  of  data  driven  procedures  using  a 
divide  and  conquer  procedure.  The  evaluation  problem  is  divided  into  four  main  procedures,  each 
with  it’s  associated  databases  and  rules  as  shown  in  Figure  2: 


Figure  2 


Recommendation 


Results 
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Evaluation  procedure  1 prepares  raw  data  and  converts  it  into  symbolic  text  or  status  values  that 
carry  the  meaning  imparted  by  the  data.  This  process  uses  current  sample,  historical  and 
statistical  data  to  completely  define  a parameters  meaning  in  simple  language  terms.  In  this 
process,  each  input  parameter  is  considered  an  object  defined  by  a series  of  attributes.  The 
symbolic  value  of  each  attribute  is  indicated  by  plain  language,  domain  related  terms.  For 
example: 

Input  Parameter:  Percent  Water 

Descriptive  Attributes:  Level  {Normal,  Marginal,  Reportable) 

Trend:  {Decreasing,  Stable,  Increasing,  Etc) 

Rank:  {Hi-Reader,  Nominal,  Low-Reader) 

For  simplicity,  all  input  data  is  defined  by  the  same  attributes  and  each  attribute’s  value  is 
established  by  statistically  based  limits.  Attribute  text  phrases  are  cnosen  to  impart  real  meaning 
in  human  terms. 

Evaluation  Procedure  2 combines  parameter  attributes  to  generate  an  overall  Condition  Indicator 
Status.  The  status  of  the  condition  indicator  is  related  tc  the  maintenance  response  required  by 
corporate  policy  should  the  indicated  abnormal  status  of  the  parameter  be  true.  For  Example: 

Condition  Indicator:  Percent  Water 

Descriptive  Attributes:  Normal,  Alert,  Urgent,  Hazard,  Danger 

Normal  No  Action  Required,  Continue  Routine  Sampling 
Alert  Shorten  Sample  Interval 

Urgent  Maintenance  Recommended,  Deferral  Permitted 

Hazard  Maintenance  Required,  No  Deferral  Permitted 

Danger  Shut  Machine  Down,  Immediate  Maintenance  Required 

Since  a condition  indicator  can  sometimes  indicate  an  abnormal  state  for  other  reasons  such  as 
false  positives,  bad  samples,  multiple  occurring  fault  signatures,  improper  maintenance, 
unreported  maintenance  or  the  lack  of  maintenance,  other  evaluation  steps  are  necessary  to  ensure 
an  accurate  response  any  indicated  abnormal  status,  anc  eliminate  any  potential  false  ibsitive. 

Evaluation  Procedure  3 compares  all  abnormal  condition  indicator  statuses  to  a library  of  fault 
signatures  to  arrive  at  a diagnosis.  A positive  diagnosis  increases  the  certainty  that  an  abnormal 
data  indication  is  justifiably  abnormal  and  rates  a maintenance  response.  The  fault  signature 
library  contains  signatures  for  all  known  faults,  bad  sample  indications,  false  positive  indications, 
inappropriate  trend  indications  or  any  known  symptom  which  could  impact  the  accuracy  or 
reliability  of  an  intended  maintenance  response. 

Evaluation  Procedure  4 combines  diagnosis  and  condition  indicator  status  levels  and  generates  an 
overall  risk  of  failure  indication.  This  module  searches  the  users  maintenance  database,  scheduled 
maintenance  to-do  lists  for  factors  which  would  alter  a maintenance  recommendation  based  solely 
on  condition  data.  For  example,  it  would  be  more  desirable  to  inspect  or  repair  a fault  at  a PM 
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interval  and  knowledge  of  the  next  PM  may  prevent  a request  for  shutdown  and  repair  of  a 
machine.  Similarly,  knowledge  of  recent  component  and  lubricant  maintenance  would  refine  the 
response  generated  by  a purely  condition  based  recommendation.  Once  the  maintenance  data  is 
known,  the  a set  of  corporate  business  rules  determines  the  appropriate  maintenance  response 
which  is  printed  on  the  output  report. 

An  expert  system  based  on  an  SPC  based  paradigm,  such  as  the  one  shown  in  Figure  2 
above,  can  evaluate  sample  data,  rapidly  and  consistently,  and  return  reliable  recommendations  for 
all  samples  where  there  is  a high  certainty  of  outcome.  Favorably,  this  is  over  98%  of  all  samples. 
In  the  few  cases  where  there  is  incomplete  data  or  unusual  external  factors  limiting  the  certainty 
of  interpretation,  the  expert  system  can  request  re-tests,  additional  tests,  additional  samples,  or 
assistance. 

The  expert  systems  used  by  Canadian  Pacific,  Canadian  National  and  CSX  Transportation 
are  excellent  examples  of  the  level  of  reliability  that  can  be  achieved  using  a statistically  based 
data  interpretation  paradigm.  For  example,  the  CN  and  CSX  systems  operate  completely 
unattended,  transmitting  recommendations  directly  to  the  railways’  maintenance  shops.  The  CSX 
System  processes  over  700  samples  per  day  and  supports  the  integrity  of  a fleet  of  3000 
locomotives  providing  maximum  reliability  at  a substantial  saving  in  laboratory  labor.  This  level 
of  success  can  be  easily  achieved  if  the  following  development  considerations  are  followed: 

1.  Perform  a failure  modes,  effects  and  criticality  analysis  for  each  machine  type  to  determine  the 
failure  modes  which  are  most  damaging  to  reliable  equipment  operations,  which  are  economical  to 
monitor,  their  respective  fault  mode  symptoms  (fault  signatures)  and  the  appropriate  sampling 
interval. 

2 . Select  the  required  analytical  tests  to  be  performed  on  each  sample  from  an  evaluation  of 
failure  mode  symptoms.  Only  tests  which  economically  provide  failure  symptom  data  need  to  be 
considered.  T Tse  the  failure  modes  analysis  as  the  primary  guidance  in  the  selection  of  tests.  Be 
wary  of  tradicional  testing  conventions,  many  traditional  tests  were  developed  for  new-oil 
performance  testing,  not  used-oil  condition  monitoring.  Data  parameters  that  do  not  relate  to 
fault  indicators  or  bad  sample/test  indicators  only  consume  resources  with  little  probability  of 
generating  a return. 

3.  Develop  a structured  set  of  alarm  responses  in  accordance  with  maintenance  and  operational 
policy.  These  responses  dictate  the  maintenance  measures  to  be  taken  when  a particular  alarm 
level  is  encountered. 

4.  Utilize  SPC  procedures  to  calculate  an  appropriate  alarm  limit  corresponding  to  each  alarm 
status  response.  Magnitude,  trend  and  other  statistical  evaluations  may  be  combined  to  achieve 
the  desired  evaluation  matrix. 

5.  Develop  and  validate  a structured  set  of  condition  indicator/failure  mode  relationships.  These 
diagnostic  indicators  are  encoded  into  the  expert  system  knowledge  base  (fault  library)  to  provide 
the  basis  for  fault  diagnosis  or  verification. 
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Lastly,  integrate  the  evaluation  steps  into  a logical  paradigm  such  as  indicated  in  Figure  2 above. 
Divide  the  data  evaluation  problem  into  simple  discrete  steps,  in  which  simple  plain  language 
analysis  rules  can  be  used  to  solve  each  step.  This  design  provides  high  run-time  performance  and 
is  easy  to  modify  and  maintain  over  the  long  term. 

Note:  system  maintenance  and  validation  procedures  are  often  performed  by  field  grade 
personnel.  It  will  be  very  helpful  to  have  the  expert  system  knowledge  base  and  any  mathematical 
formulae  encoded  in  plain  domain  related  language  for  easy  reading  and  understanding. 

CONCLUSION 

The  development  and  operation  of  the  oil  analysis  expert  systems  over  the  past  10  years 
demonstrates  that  expert  system  technology  is  mature  and  can  be  used  effectively  for  used-oil 
analysis  autorrr^on.  The  systems  implemented  at  major  North  American  railways  also  indicate 
the  level  of  sophistication  that  can  be  achieved  with  low  cost  expert  system  technology!4'5).  The 
major  lesson  learned  from  the  development  of  these  systems  is  that  the  key  to  the  successful 
development  of  a used-oil  analysis  expert  system  requires  the  establishment  of: 

1.  simple  and  reliable  'condition  indicators'  of  equipment  failure  mechanisms  based  on  oil  wear 
metal  and  contamination  data, 

2.  statistically  based  limits  for  the  magnitude  and  trend  of  each  condition  indicator, 

3.  validated  fault  signatures  indicating  the  relationships  between  condition  indicator  variations  (or 
combinations)  and  specific  component  failure  mechanisms,  and; 

4.  application  of  consistent  operational  and  maintenance  practices  to  ensure  high  data  integrity. 
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